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SUMMARY 


Fisher randomization tests for Neyman’s null hypothesis of no average treatment effects are 
considered in a finite population setting associated with completely randomized experiments 
with more than two treatments. The consequences of using the F’ statistic to conduct such a test 
are examined both theoretically and computationally, and it is argued that under treatment ef- 
fect heterogeneity, use of the F’ statistic in the Fisher randomization test can severely inflate the 
type I error under Neyman’s null hypothesis. An alternative test statistic is proposed, its asymp- 
totic distributions under Fisher’s and Neyman’s null hypotheses are derived, and its advantages 
demonstrated. 


Some key words: Additivity; Fisher randomization test; Null hypothesis; One-way layout 


1. INTRODUCTION 


One-way analysis of variance (Fisher, 1925) is arguably the most commonly used tool to an- 
alyze completely randomized experiments with more than two treatments. The standard F' test 
for testing equality of mean treatment effects can be justified either by assuming a linear addi- 
tive super population model with identically and independently distributed normal error terms, 
or by using the asymptotic randomization distribution of the F statistic. As observed by many 
experts, units in most real-life experiments are rarely random samples from a super population, 
making a finite population randomization-based perspective on inference important (e.g. Rosen- 
baum, 2010; Imbens & Rubin, 2015; Dasgupta et al., 2015). Fisher randomization tests are useful 
tools for such inference, because they pertain to a finite population of units, and assess the sta- 
tistical significance of treatment effects without making any assumptions about the underlying 
distribution of the outcome. 

In causal inference from finite population, two types of hypotheses are of interest: Fisher’s 
sharp null hypothesis of no treatment effect on any experimental unit (Fisher, 1935; Rubin, 1980), 
and Neyman’s null hypothesis of no average treatment effect (Neyman, 1923, 1935). These hy- 
potheses are equivalent without treatment effect heterogeneity (Ding et al., 2016) or equivalently 
under the assumption of strict additivity of treatment effects, i.e., the same treatment effect for 
each unit (Kempthorne, 1952). In the context of a multi-treatment completely randomized ex- 
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periment, Neyman’s null hypothesis allows for treatment effect heterogeneity, which is weaker 
than Fisher’s null hypothesis and is of greater interest. We find that the Fisher randomization test 
using the F' statistic can inflate the type I error under Neyman’s null hypothesis, when the sample 
sizes and variances of the outcomes under different treatment levels are negatively associated. 
We propose to use the X? statistic defined in §5, a statistic robust to treatment effect heterogene- 
ity, because the resulting Fisher randomization test is exact under Fisher’s null hypothesis and 
controls asymptotic type I error under Neyman’s null hypothesis. 


2. COMPLETELY RANDOMIZED EXPERIMENT WITH J TREATMENTS 


Consider a finite population of N experimental units, each of which can be exposed to any one 
of J treatments. Let Y;(7) denote the potential outcome (Neyman, 1923) of unit ¢ when assigned 
to treatment level j (i = 1,...,N;j =1,...,J). For two different treatment levels 7 and 7’, 
we define the unit-level treatment effect as 7;(7, 7’) = Yi(7) — Yi(y’), and the population-level 
treatment effect as 


N N 
tI) SNS AGI SG) HV. = OH, 
4=1 i=l 


where Y.(j7) = N7! i 1 Yi(J) is the average of the N potential outcomes for treatment j. 

The treatment assignment mechanism can be represented by the binary random variable 
W;(j), which equals 1 if the ith unit is assigned to treatment j, and 0 otherwise. Equivalently, 
it can be represented by the discrete random variable W; = ae gW;(7) € {1,..., J}, the 
treatment received by unit 7. Let (Wi,..., Wy) be the treatment assignment vector, and let 
(w1,...,wy) denote its realization. For the N = Sy N; units, (Nj,..., Ny) are assigned 
at random to treatments (1,...,J) respectively, the treatment assignment mechanism satisfies 
pr{(Wi,..., Ww) = (wi,..., ww) } = [1fy NGI/N! if 3, Wi(9) = Nj, and 0 otherwise. 
The observed outcomes are deterministic functions of the treatment received and the potential 
outcomes, given by Y,°°S = ee Wipe) = eds 


3. THE FISHER RANDOMIZATION TEST UNDER THE SHARP NULL HYPOTHESIS 


Fisher (1935) was interested in testing the following sharp null hypothesis of zero individual 
treatment effects: 


Ho(Fisher) : (1) =---=Y¥i(J), (¢=1,...,.N). 


Under Ho(Fisher), all the J potential outcomes Y;(1),..., Yi(J) are equal to the observed out- 
come Y,°S, for all units i = 1,..., N. Thus any possible realization of the treatment assignment 
vector would generate the same vector of observed outcomes. This means, under Ho(Fisher) 
and given any realization (W1,...,Ww) = (wi,...,ww), the observed outcomes are fixed. 
Consequently, the randomization distribution or null distribution of any test statistic, which is 
a function of the observed outcomes and treatment assignment vector, is its distribution over all 
possible realizations of the treatment assignment. The p-value is the tail probability measuring 
the extremeness of the test statistic with respect to its randomization distribution. Computa- 
tionally, we can enumerate or simulate a subset of all possible randomizations to obtain this 
randomization distribution of any test statistic and thus perform the Fisher randomization test 
(Fisher, 1935; Imbens & Rubin, 2015). Fisher (1925) suggested using the F’ statistic to test the 
departure from Ho(Fisher). Define Y.°°S(j) = N. Ge ae W;(j)Y;°°S as the sample average of 
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the observed outcomes within treatment level 7, and yoos — N-1 Sor 1 ye" as the sample aver- 
age of all the observed outcomes. Define s2,.(j) = (Nj; —1)7! ae Wilf) {VPs — VPS (7)? 
and s7,. = (N —1)71 ye (70s — Y.°°S)? as the corresponding sample variances with divisors 


N; — 1 and N — 1, respectively. Let 


J 
SSTre = S° Nj{¥S(j) _ yor 
j=l 
be the treatment sum of squares, and let 
J 7 J 
ssRes= > So {¥P — ¥(s) 2 = SG — 1) 52() 
I=L EW, (7)=1 gat 


be the residual sum of squares. The treatment and residual sums of squares sum up to the total 
sum of squares au (008 — Y0Ps)? = (N — 1)s2,.. The F statistic 


Fe SSTre/(J—1) _ MSTre (1) 
~ SSRes/(N — J) MSRes 


is defined as the ratio of the mean squares of treatment MSTre = SSTre/(J — 1) to the mean 
squares of residual MSRes = SSRes/(N — J). 

The distribution of (1) under Ho(Fisher) can be well approximated by an F'y_1,n—, distri- 
bution with degrees of freedom J — 1 and N — J, as is often used in the analysis of variance 
table obtained from fitting a normal linear model. Whereas it is relatively easy to show that (1) 
follows F'y_1,n—, if the observed outcomes follows a normal linear model drawn from a super 
population, arriving at such a result using a purely randomization-based argument is non-trivial. 
Below, we state a known result on the approximate randomization distribution of (1), in which 
we use the notation Ay ~ By to represent two sequences of random variables {Ay }9)_, and 
{Bn}%_, that have the same asymptotic distribution as N — oo. Throughout our discussion, 
we assume the following regularity conditions required by the finite population central limit 
theorem for causal inference (Li & Ding, 2017). 


Condition 1. As N —+ ox, for all j, Nj/N has a positive limit, Y.(j) and $?(j) have finite 
limits, and N-1 Max}<j<N Ya) _ Y.(j)|? — 0. 


THEOREM |. Assume Ho(Fisher). Over repeated sampling of (Wi,...,Wwy), the expec- 
tations of the residual and treatment sums of squares are E(SSTre) = (J —1)s2,, and 
E(SSRes) = (N — J)s2,., and as N —+ 0, the asymptotic distribution of (1) is 


ial =t1) 
{((N —1)—x3_i}H/(N — J) 


Remark 1. As N — o, both the statistic F’ and random variable F’7_;,j—j are asymptoti- 
cally aaa /(J — 1). The original F' approximation for randomization inference for a finite pop- 
ulation was derived by cumbersome moment matching between statistic (1) and the correspond- 
ing fy_1,n—, distribution (Welch, 1937; Pitman, 1938; Kempthorne, 1952). Similar to Silvey 
(1954), we provide a simpler proof based on the finite population central limit theorem in the 
Supplementary Material. 


Px 


~ Fy_-i,n-J- 


Remark 2. Under Ho(Fisher), the total sum of squares is fixed, but its components SSTre 
and SSRes are random through the treatment assignment (W1,..., Wy), and their expectations 
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are calculated with respect to the distribution of the treatment assignment. Also, the ratio of 
expectations of the numerator MSTre and denominator MSRes of (1) is 1 under Ho(Fisher). 


4. SAMPLING PROPERTIES OF THE F STATISTIC UNDER NEYMAN’S NULL HYPOTHESIS 


In Section 3, we discussed the randomization distribution, i.e., the sampling distribution under 
H (Fisher), of the F' statistic in (1). However, the sampling distribution of the F' statistic under 
Neyman’s null hypothesis of zero average treatment effect (Neyman, 1923, 1935), i.e., 

Ho(Neyman) : Y.(1) =---=Y.(J), 
is often of major interest but is under-investigated (Imbens & Rubin, 2015). Hop(Neyman) im- 
poses weaker restrictions on the potential outcomes than Ho(Fisher), making it impossible to 
compute the exact, or even approximate distribution of the F' statistic under Hp(Neyman). How- 
ever, analytical expressions for E(SSTre) and E(SSRes) can be derived under Ho(Neyman) 
along the lines of Theorem 1, and can be used to gain insights about the consequences of testing 
Ho(Neyman) using the Fisher randomization test with the F’ statistic in (1). 

For treatment level 7 =1,...,J, define p; = Nj/N as the proportion of the units, and 
$7(7) =(N-1)7! tv) — Y.(j)}? as the finite population variances of potential out- 
comes. Let Y.(-) = Spi) and S$? = a p;57(j) be the weighted averages of the fi- 
nite population means and variances. The sampling distribution of the F’ statistic in (1) depends 
crucially on the finite population variance of the unit-level treatment effects 


N 
829-7) = (N - 1)" {an - 7G, 97. 
i=1 


DEFINITION 1. The potential outcomes {Y;(j):i=1,...,N, j =1,...,J} have strictly 
additive treatment effects if for all 7 4 j', the unit-level treatment effects 7;(j, 7’) are the same 
fori =1,...,N, or equivalently, S?(j-j') = 0 for all j # 9’. 


Kempthorne (1955) obtained the following result on the sampling expectations of SSRes and 
SSTre for balanced designs with p; = 1/J under the assumption of strict additivity: 


J 
E(SSRes) = (N — J)S?, E(SSTre) = Se) -¥.()}? + (J -1)8". (2) 
j=l 


This result implies that with balanced treatment assignments and strict additivity, E(MSRes — 
MSTre) = 0 under Ho(Neyman), and provides a heuristic justification for testing Ho(Neyman) 
using the Fisher randomization test with the F’ statistic. However, strict additivity combined with 
Ho(Neyman) implies Ho (Fisher), for which this result is already known by Theorem 1. We will 
now derive results that do not require the assumption of strict additivity, and thus are more 
general than those in Kempthorne (1955). For this purpose, we introduce a measure of deviation 


from additivity. Let 
A= S 0S 6 pip 8? (5-9) 
j<i! 
be a weighted average of the variances of unit-level treatment effects. By Definition 1, A = 0 
under strict additivity. If strict additivity does not hold, i.e., there is treatment effect heterogeneity 
(Ding et al., 2016), then A # 0. Thus A is a measure of deviation from additivity and plays a 
crucial role in the following results on the sampling distribution of the F' statistic. 
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THEOREM 2. Over repeated de iat: of (W1,..., Ww), the expectation of the residual sum 
of squares is E(SSRes) = De (Nj — 1)S?(j), and the expectation of the treatment sum of 
squares is 

J J 
E(SSTre) = S~ N; {¥.G) - ¥.() ee (1 —p;)S2(7) — A, 
j=l j=l 


which reduces to E(SSTre) = aC — p;)S?(j) — A under Ho(Neyman). 


COROLLARY |. Under Ho(Neyman) with strict additivity in Definition 1, or, equivalently, 
under Ho(Fisher), the above results reduce to E(SSRes) = (N — J)S? and E(SSTre) = (J — 
1), which coincide with Theorem |. 


COROLLARY 2. For a balanced design with p; = 1/J, 
Na 7 
E(SSRes) = (N — J)S?, E(SSTre) = a SVG) - OP + (7-18? - A. 


Furthermore, under Hy(Neyman), E(SSRes) = (N — J).S? and E(SSTre) = (J —1)S?— A, 
implying that the difference between the mean squares of the residual and the treatment is 
E(MSRes — MSTre) = A/(J — 1) > 0. 


The result in (2) is a special case of Corollary 2 for A = 0. Corollary 2 implies that, for 
balanced designs, if the assumption of strict additivity does not hold, then testing Ho(Neyman) 
using the Fisher randomization test with the F' statistic may be conservative, in a sense that it 
may reject a null hypothesis less often than the nominal level. However, for unbalanced designs, 
the conclusion is not definite, as will be seen from the following result. 


COROLLARY 3. Under Hy(Neyman), the difference between the mean squares of the residual 
and the treatment is 


J 
E(MSRes — MSTre) = Cee S\(pj — J") 879) + oan 


Corollary 3 shows that the mean square of the residual may be bigger or smaller than that of the 
treatment, depending on the balance or lack thereof of the experiment and the variances of the 
potential outcomes. Under Ho(Neyman), when the p;’s and $?(j)’s are positively associated, 
the Fisher randomization test using F' tends to be conservative; when the p;’s and $?(j)’s are 
negatively associated, the Fisher randomization test using F' may not control correct type I error. 


5. A TEST STATISTIC THAT CONTROLS TYPE I ERROR MORE PRECISELY THAN F' 


To address the failure of the F' statistic to control type I error of the Fisher randomization test 
under Hy(Neyman) in unbalanced experiments, we propose to use the following X? test statistic 
for the Fisher randomization test. Define Q; = N; af Sops(J }), and define the weighted average of 
the sample means as Y,0°S = ae YG) oa Q;. Define the X? test statistic as 


J 
= OGY 3) 
j=l 
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which can be obtained from weighted least squares. This test statistic has been exploited in the 
classical analysis of variance literature (e.g., James, 1951; Welch, 1951; Johansen, 1980; Rice 
& Gaines, 1989; Weerahandi, 1995; Krishnamoorthy et al., 2007) based on the normal linear 
model with heteroskedasticity, and a similar idea called studentization has been adopted in the 
permutation test literature (e.g., Neuhaus, 1993; Janssen, 1997, 1999; Janssen & Pauls, 2003; 
Chung & Romano, 2013; Pauly et al., 2015). 

Clearly, replacing the F statistic by the X? statistic does not affect the validity of the 
Fisher randomization test for testing Ho(Fisher), because we always have an exact test for 
H (Fisher) no matter which test statistic we use. Moreover, we derive a new result showing 
that the Fisher randomization test using X? as the test statistic can also control the asymptotic 
type I error for testing Hy(Neyman). This means that the Fisher randomization test using X? as 
the test statistic can control the type I error under both Ho(Fisher) and Hp(Neyman) asymptoti- 
cally, making X? a more attractive choice than the classical F statistic for conducting the Fisher 
randomization test. Below, we formally state this new result. 


THEOREM 3. Under Ho(Fisher), the asymptotic distribution of X? is Kad as N —> oo. Un- 
der Hy(Neyman), the asymptotic distribution of X ? is stochastically dominated by emer ieé., 
for any constant a > 0, limy-soo pt(X? > a) < pr(x4_1 > a). 


Remark 3. Under Ho(Fisher), the randomization distribution of SSTre/s?,, follows 7_, 
asymptotically as shown in the Supplementary Material. Under Ho(Neyman), however, the 
asymptotic distribution of SSTre/ cee is not RG th and the asymptotic distribution of F’ is not 
Fy—J,j—1 aS suggested by Corollary 3. Fortunately, if we weight each treatment square by the 
inverse of the sample variance of the outcomes, the resulting X? statistic preserves the asymp- 
totic eae randomization distribution under Ho(Fisher), and has an asymptotic distribution that 


is stochastically dominated by x7,_, under Hy(Neyman). 


Therefore, under Ho(Neyman), the type I error of the Fisher randomization test using X? does 
not exceed the nominal level. Although we can perform the Fisher randomization test by enu- 
merating or simulating from all possible realizations of the treatment assignment, Theorem 3 
suggests that an asymptotic rejection rule against Ho(Fisher) or Hy(Neyman) is X? > x14, 
the 1 — a quantile of the Sa distribution. Because the asymptotic distribution of X? under 
Ho(Neyman) is stochastically dominated by Gar its true 1 — a quantile is asymptotically 
smaller than 1_,, and the corresponding Fisher randomization test is conservative in the sense 
of having smaller type I error than the nominal level asymptotically. 


Remark 4. This asymptotic conservativeness is not particular to our test statistic, but rather 
a feature of finite population inference (Neyman, 1923; Aronow et al., 2014; Imbens & Rubin, 
2015). It distinguishes Theorem 3 from previous results in the permutation test literature (e.g., 
Chung & Romano, 2013; Pauly et al., 2015), where the conservativeness did not appear and the 
correlation between the potential outcomes played no role in the theory. 


The form of X? in (3) suggests its difference from F' when the potential outcomes have differ- 
ent variances under different treatment levels. Otherwise we show that they are asymptotically 
equivalent in the following sense. 


CoroLiary 4. If $2(1) =--» = S?(J), then (J —1)F © X?. 


Under treatment effect additivity in Definition 1, the condition S?(1) = --- = $?(J) holds, 
and the equivalence between (J — 1)F and X? guarantees that the Fisher randomization tests 
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using F' and X? have the same asymptotic type I error and power. However, Corollary 4 is a 
large-sample result, and we evaluate it in finite samples in the Supplementary Material. 


6. SIMULATION 
6-1. Type l error of the Fisher randomization test using F 


In this subsection, we use simulation to evaluate the finite sample performance of the 
Fisher randomization test using F’ under Ho(Neyman). We consider the following three cases, 
where N’(j1, 07) denotes a normal distribution with mean j1 and variance a”. We choose signifi- 
cance level 0.05 for all tests. 

Case 1. For balanced experiments with sample sizes N = 45 and N = 120, we generate poten- 
tial outcomes under two cases: (1.1) ¥;(1) ~ N(0, 1), Y;(2) ~ (0, 1.27), ¥i(3) ~ N(0, 1.57); 
and (1.2) Y;(1) ~ N(0,1), Yi(2) ~ N(0, 27), Y;(3) ~ N(0, 32). These potential outcomes are 
independently generated, and standardized to have zero means. 

Case 2. For unbalanced experiments with sample sizes (1, No, N3) = (10, 20,30) and 
(Ni, No, N3) = (20, 30,50), we generate potential outcomes under two cases: (2.1) Y;(1) ~ 
N (0,1), ¥i(2) = 2¥;(1), ¥i(3) = 8Y;(1); and (2.2) ¥;(1) ~ (0, ), ¥i(2) = 3¥;(1), ¥i(3) = 
5Y;(1). These potential outcomes are standardized to have zero means. In this case, pj < po < p3 
and.S?(1);< S72) $78). 

Case 3. For unbalanced experiments with sample sizes (1, No, N3) = (30,20, 10) and 
(Ni, No, N3) = (50, 30, 20), we generate potential outcomes under two cases: (3.1) Y;(1) ~ 
N (0,1), ¥i(2) = 2Y;(1), ¥i(3) = 3Y;(1); and (3.2) ¥i(1) ~ M(0, ), ¥i(2) = 3Y;(1), ¥i(3) = 
5Y; (1). These potential outcomes are standardized to have zero means. In this case, pj > po > p3 
andS*(1) <$7(2) < S73). 

Once generated, the potential outcomes are treated as fixed constants. Over 2000 simulated 
randomizations, we calculate the observed outcomes, and then perform the Fisher randomization 
test using F to approximate the p-values by 2000 draws of the treatment assignment. The his- 
tograms of the p-values are shown in Figures 1(a)—1(c) corresponding to cases 1-3 above. We 
also report the rejection rates associated with these cases along with their standard errors in the 
next few paragraphs. 

In Figure 1(a), the Fisher randomization test using F' is conservative with p-values distributed 
towards 1. With larger heterogeneity in the potential outcomes, the histograms of the p-values 
have larger masses near 1. For case (1.1), the rejection rates are 0.010 and 0.018, and for case 
(1.2), the rejection rates are 0.023 and 0.016, for sample sizes N = 45 and N = 120 respectively, 
with all Monte Carlo standard errors no larger than 0.003. 

In Figure 1(b), the sample sizes under each treatment level are increasing in the variances 
of the potential outcomes. The Fisher randomization test using F' is conservative with p-values 
distributed towards 1. Similar to Figure 1(a), with larger heterogeneity in the potential outcomes, 
the p-values have larger masses near 1. For case (2.1), the rejection rates are 0.016 and 0.014, 
and for case (2.2), the rejection rates are 0.015 and 0.011, for sample sizes N = 45 and N = 120 
respectively, with all Monte Carlo standard errors no larger than 0.003. 

In Figure 1(c), the sample sizes under different treatment levels are decreasing in the variances 
of the potential outcomes. For case (3.1), the rejection rates are 0.133 and 0.126, and for case 
(3.2), the rejection rates are 0.189 and 0.146, for sample sizes N = 45 and N = 120 respectively, 
with all Monte Carlo standard errors no larger than 0.009. The Fisher randomization test using 
F does not preserve correct type I error with p-values distributed towards 0. With larger hetero- 
geneity in the potential outcomes, the p-values have larger masses near 0. 
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Fig. 1: Histograms of the p-values under Hp(Neyman) based on the Fisher randomization tests 
using F’, with grey histogram and white histograms for the first and second sub-cases. 


These empirical findings agree with our theory in Section 4, i.e., if the sample sizes under 
different treatment levels are decreasing in the sample variances of the observed outcomes, then 
the Fisher randomization test using F’ may not yield correct type I error under Ho(Neyman). 


6-2. Type I error of the Fisher randomization test using X? 

Figure 2 shows the same simulation as Figure 1, but with test statistic X?. 

Figure 2(a) shows a similar pattern as Figure 1(a). For case (1.1), the rejection rates are 0.016 
and 0.012, and for case (1.2), the rejection rates are 0.014 and 0.010, for sample sizes N = 45 
and N = 120 respectively, with all Monte Carlo standard errors no larger than 0.003. 

Figure 2(b) shows better performance of the Fisher randomization test using X? than Figure 
1(b), with p-values closer to uniform. For case (2.1), the rejection rates are 0.032 and 0.038, and 
for case (2.2), the rejection rates are 0.026 and 0.030, for sample sizes N = 45 and N = 120 
respectively, with all Monte Carlo standard errors no larger than 0.004. 

Figure 2(c) shows much better performance of the Fisher randomization test using X? than 
Figure 1(c), because the p-values are much closer to uniform. For case (3.1), the rejection rates 
are 0.052 and 0.042, and for case (3.2), the rejection rates are 0.048 and 0.040, for sample 
sizes N = 45 and N = 120 respectively, with all Monte Carlo standard errors no larger than 
0.005. This agrees with our theory that the Fisher randomization test using X? can control the 
asymptotic type I error under Hp(Neyman). 


6:3. Power comparison of the Fisher randomization tests using F and X? 


In this subsection, we compare the powers of the Fisher randomization tests using F and X? 
under alternative hypotheses. We consider the following cases. 

Case 4. For balanced experiments with sample sizes N = 30 and N = 45, we generate po- 
tential outcomes from Y;(1) ~ N(0, 1), Yi(2) ~ N(0, 27), Y;(3) ~ MN (0,32). These potential 
outcomes are independently generated, and shifted to have means (0, 1, 2). 

Case 5. For unbalanced experiments with sample sizes (N,, No, N3) = (10, 20,30) and 
(Ni, No, Nz) = (20, 30,50), we first generate Y;(1) ~ (0,1) and standardize them to have 


A randomization-based perspective of analysis of variance 9 


wo 
- | AL 
> | See DOG Lees lll ee ee > af =ay = 
Be ae Tp Ee ‘a im 2 2 e | 
oO oO wo oO Oo 
To 3 oO a ne} 4 
2 2 | 2 
2 T T Lo) T T 2 T T 
0.0 02 04 06 08 1.0 0.0 02 04 06 08 1.0 0.0 02 04 06 08 1.0 
N=45 Balanced N=60, (N1,N2,N3)=(10,20,30) N=60, (N1,N2,N3)=(30,20,10) 
wo 
- | AL 
> | > © I____----~ > | _ a e rT 
Py  hosen cote at ey rm G7 B © 
Al — = p= 4 
oO oO wo oO i=) 
>} mo} ol me} 4 
° ° o 
2 T T ° T T 2 T T 
0.0 02 04 06 08 1.0 0.0 02 04 06 08 1.0 0.0 02 04 06 08 1.0 
N=120 Balanced N=100, (N1,N2,N3)=(20,30,50) N=100, (N1,N2,N3)=(50,30,20) 
(a) Balanced experiments, case 1 (b) Unbalanced experiments, case 2 (c) Unbalanced experiments, case 3 


Fig. 2: Histograms of the p-values under Ho(Neyman) based on the Fisher randomization tests 
using X?, with grey histogram and white histograms for the first and second sub-cases. 


mean zero, and we then generate Y;(2) = 3Y;(1) +1 and Y;(3) = 5Y;(1) + 2. In this case, 
Pi < po < p3 and $?(1) < $?(2) < S2(3). 

Case 6. For unbalanced experiments with sample sizes (1, No, N3) = (30,20, 10) and 
(Ni, No, N3) = (50, 30, 20), we generate potential outcomes the same as the above case 5. In 
this case, p; > p2 > p3 and $?(1) < $?(2) < $?(3). 

Over 2000 simulated data sets, we perform the Fisher randomization test using F and X? and 
obtain the p-values by 2000 draws of the treatment assignment. The histograms of the p-values, 
in Figures 3(a)—3(c), correspond to cases 4-6 above. The Monte Carlo standard errors for the 
rejection rates below are all close but no larger than 0.011. 

For case 4, the rejection rates using X? and F are 0.290 and 0.376 respectively with sample 
size N = 30, and 0.576 and 0.692 respectively with sample size N = 45. For case 5, the powers 
using X? and F are 0.178 and 0.634 respectively with sample size N = 60, and 0.288 and 
0.794 respectively with sample size N = 100. Therefore, when the experiments are balanced or 
when the sample sizes are positively associated with the variances of the potential outcomes, the 
Fisher randomization test using F has larger power than that using X?. 

For case 6, the rejection rates using X? and F are 0.494 and 0.355 respectively with sample 
size N = 60, and 0.642 and 0.576 respectively with sample size N = 100. Therefore, when 
the sample sizes are negatively associated with the variances of the potential outcomes, the 
Fisher randomization test using F' has smaller power than that using X?. 


6-4. Simulation studies under other distributions and practical suggestions 
In the Supplementary Material, we give more numerical examples. First, we conduct simula- 
tion studies in parallel with 886-1-6-3 with outcomes generated from exponential distributions. 
The conclusions are nearly identical to those in 886-1-6-3, because the finite population cen- 
tral limit theorems holds under mild moment conditions without imposing any distributional 
assumptions. 
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(a) Balanced experiments, case 4 (b) Unbalanced experiments, case 5 (c) Unbalanced experiments, case 6 


Fig. 3: Histograms of the p-values under alternative hypotheses based on the Fisher randomiza- 
tion tests using F and X°, with grey histograms for X? and white histograms for F. 


Second, we use two numerical examples to illustrate the conservativeness issue in Theorem 3. 
Third, we compare different behaviors of the Fisher randomization tests using F and X? in two 
real-life examples. 


7. DISCUSSION 


As shown in the proofs of Theorems 1 and 3 in the Supplementary Material, we need 
to analyze the eigenvalues of the covariance matrix of {Y.'(1),..., Y°>S(.J)} to obtain the 
properties of F and X? for general J > 2. Moreover, we consider the case with J = 2 
to gain more insights and to make connections with existing literature. For 7 4 7’, an un- 
biased estimator for T(j, 7’) is 7(J, 9’) = Y.S(j) — Y.°S(j’), which has sampling variance 
var{7(j, j’)} = S2(7)/Nj + S29) /Nj — S?(9-7")/(N;j + Nj) and an conservative variance 
estimator s2,.(j)/Nj + $24.(7’)/Nj (Neyman, 1923). 


obs 


COROLLARY 5. When J = 2, the F and X°? statistics reduce to 


Fe me hs x2 = Toy) 
s24s(1)/No a 8245(2)/N1 8345(1)/N1 ec 834,,(2)/No’ 


where the approximation of F is due to ignoring the difference between N and N — 2 and the 
difference between Nj and N; — 1 (j = 1,2). Under Ho(Fisher), F ~ x7 and X? ~ x2. Under 
Ho(Neyman), F ~ Cyy7 and X? ~ C2x2, where 


eh var{7(1,2)} lim var{7(1,2)} 
N->-+00 S?(1)/No2 + S?(2)/M1 , N—--+00 S?(1)/M, + S?(2)/No = 
Depending on the sample sizes and the finite population variances, C; can be either larger than 
or smaller than 1. Consequently, using F’ in the Fisher randomization test can be conservative 
or anti-conservative under Ho(Neyman). In contrast, C’z is always no larger than 1, and there- 
fore using X? in the Fisher randomization test is conservative for testing Hp(Neyman). Neyman 


Cy = (4) 


Ci= 
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(1923) proposed to use the square root of X? to test Ho(Neyman) based on a normal approxi- 
mation, which is asymptotically equivalent to the Fisher randomization test using X?. Both are 
conservative unless the unit-level treatments are constant. 

In practice, for treatment-control experiments, the difference-in-means statistic 7(1,2) was 
widely used in the Fisher randomization test (Imbens & Rubin, 2015), which, however, can be 
conservative or anti-conservative for testing Ho(Neyman) as shown in Gail et al. (1996), Lin 
et al. (2017) and Ding (2017) using numerical examples. We formally state this result below, 
recognizing the equivalence between 7(1, 2) and F' in a two-sided test. 


COROLLARY 6. When J = 2, the two-sided Fisher randomization test using 7 (1,2) is equiv- 
alent to using 


a) ee 7 (2) 
Ns2.g/(NiN2) — 824.(1)/No + 83,,(2)/Ni + 72(1, 2)/N’ 


where the approximation is due to ignoring the difference between (N,N, —1, Nz —1) and 
(N,N, No). Under Ho(Fisher), T? ~ F ~ x2, and under Hy(Neyman), T? ~ F ~ C1 x7 with 
C' defined in (A). 


Remark 5. Analogously, under the super population model, Romano (1990) showed that the 
Fisher randomization test using 7(1,2) can be conservative or anti-conservative for testing the 
hypothesis of equal means of two samples. Janssen (1997, 1999) and Chung & Romano (2013) 
suggested using the studentized statistic, or equivalently X, to remedy the problem of possibly 
inflated type I error, which is asymptotically exact under the super population model. 


After rejecting either Ho(Fisher) or Ho(Neyman), it is often of interest to test pair- 
wise hypotheses, i.e., for 7 4 7’, Yi(j) = Yi(j’) for all i, or ¥-(j) = Y.(j’). According to 
Corollaries 5 and 6, we recommend using the Fisher randomization test with test statistic 
729, 9’) /{s255(7)/Ny + 824,(9’)/Njv}, which will yield conservative type I error even if the ex- 
periment is unbalanced and the variances of the potential outcomes vary across treatment groups. 

The analogue between our finite population theory and Chung & Romano (2013)’s super pop- 
ulation theory suggests that similar results may also hold for layouts of higher order and other 
test statistics (Pauly et al., 2015; Chung & Romano, 2016a,b; Friedrich et al., 2017). In more 
complex experimental designs, often multiple effects are of interest simultaneously, raising the 


problem of multiple testings (Chung & Romano, 2016b). We leave these to future work. 
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Supplementary Materials 


§S8 presents the proofs, 8S9 contains examples, and §S10 gives additional simulation. 


S8. PROOFS 


To prove the theorems, we need the following lemmas about completely randomized experi- 
ments. 


LEMMA S1. The treatment assignment indicator W;(j) is a Bernoulli random variable with 
mean p; = Nj;/N and variance p;(1 — p;). The covariances of the treatment assignment indi- 
cators are 


cov{Wi(j),We(d)} =—pi(1—p)/(N-1), #2) 
cov{ Wil), Wilg!)} = —pypy (#7) 
cov{Wi(9), Wi(9’)} = pip /(N — 1), GA, #7’). 


Proof of Lemma S1. The proof is straightforward. 


_ LEMMA 82. Assume (c1,...,¢n) and (d1,...,dNn) are two fixed vectors with means ¢ and 
d, finite population variances S? and ce The finite population covariance is S.q = (S2 +S 7 — 
Se eo, where Ss. is the finite population variance of (c, — d1,...,cn — dn). For j #7, 


1 - 1l—p; 1 a 1 S, 
var 4 —— Wil) e = J oe cov < — W;(j)ci, — Wi(j’) di at ed 
{fr > N; N; > Wee N 


Proof of Lemma S2. Lemma S2 is known, and its special forms appeared in Kempthorne 
(1955). We give an elementary proof for completeness. Applying Lemma S1, we have 


J j=1 
N 
- we Wils)(ci- a} 
J i=l 
N 
= wi do var {Wil He — 2) — SS J cov{Wil3), Wid) Her — O(c — 2) 
J i=1 iki! 
1 Jy pj(1 — p,) 
= ay) Pill — Pali — 2 — DOS AG — Olce — 2) 
J i=1 ii! 
N 
~ {2 Dj) SG c)? eer SiG _ °° | 
: i=1 i=1 


14 PENG DING AND TIRTHANKAR DASGUPTA 


For j 4 j’, applying Lemma S1 again, we have 


ieee ati = re 
a> Gen a7 ah 
1 = ad 7 
= ae |e Ue—a.omoria a 


N 
= NiNp { Sosowtth) Wilj')}(ce — 2) (di - @) 


+> Sreovg ils), Wis") Her — 8) (dy — A) 


if! 
ge 28 z PjPj! 2 7 
~ NjNj -Ynnv ance a dd Noe Oe 
VV 

1 N pipe N 

rs Nj Ny Ny {Py SG e)(d; — d) 4 ee SiG — (di - a 
i=l i=1 

= —S.a/N. 

Proof of Theorem 1. Under Ho(Fisher), {Y°°S : i = 1,..., N} and SSTot = (N — 1)s?,. are 
fixed. Because {Y,°>’ : W;(j) = 1} is a simple random sample from the finite population {Y,°°° : 
i=1,...,N}, the sample mean Y.°S(j) is unbiased for the population mean YS, and the 
sample variance ee (j) is unbiased for the population variance ee Therefore, 

J 

E(SSRes) = EW, Sops (J j)} = SN; ae 1)s5 Sobs = (N — J) 85 Sobs> 
j=l 
which further implies that 
E(SStTre) = SSTot — E(SSRes) = (N — 1)s%,, — (N — J) 8245 = (J — 1)s%4¢. 
Applying Lemma S2, we have 
v7 obs / bs Pj 62 obs obs = Bobs 
var YG) = Sones COMA) OT ae (SS) 


Nj 
Therefore, the finite population central limit theorem (Li & Ding, 2017, Theorem 5), coupled 
with the variance and covariance formulae in (S5), implies 


ee Lape sip pi - er 


1/2 ¢yrobs yobs 1/2,.1/2 1/2 di 
No!“ {Y°°S(2) — Y. j= ae 
a ie { 5 } Ny | 0, 825. hi ee Po Py ; 
voy = 1/2172 1/21 /2 
Ny! er) ey —py py! —p py: eT py 


where VV; denotes a J-dimensional normal random vector. The above asymptotic covariance 


matrix can be simplified as 52, (17 — qq’) = s2,,P., where I, is the J x J identity matrix, and 


1/2 2x 


q = (p;'",.--,pj7)*. The matrix P is a projection matrix of rank J — 1, which is orthogonal to 
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the vector g. Consequently, the treatment sum of squares can be represented as SSTre = V7V ~ 
ae ee and the F statistic can be represented as 


= Stes) = XI oos/ (J = 1) 
{(N — 1)s¢4, — SSTre}/(N — J) {(N ~ 1) 8045 — X5-18cpst/(N — J) 
xj-1/(J — 1) 


OD ey ee 


Proof of Theorem 2. First, because YS (j )) = yn 1 Wild) Yi(Z)/Nj, Lemma S2 implies that 
Y.°S(j) has mean Y.(j) and variance (1 — p;)S?(j)/N;, and 


N 
cov{ Ys (7), ¥°S(7’)} = cov ts y Wi(A) Vi), Ye x ye muni} 
i=1 


= —sASG) + PU") — PGI}. 


Therefore, 


J 
ar(¥%) = S> pivar{¥.%(3)} + > > pjpyoov{¥(j), ¥°(')} 


j=l FAS 
De, 2 ape 1 
= Py) — DD ava a(S) + 2!) — PGF) 
j=l 3 AS 


= DPil (1 — p;)S?( (J) 


SE pir 2s )-S EY pire 2 £5 Drip S2 (j-7') 


JAI JAI JAI 


Because 


> > Pip S7(5) - Sn (1 — p;)S°(9), 


JAI 


SoS pypy 8 (3 = Sm 1— py )S. -Sn Lp, )5. cas 


JAI 


the variance of YS reduces to 


var (¥.°5) = (2N)* SOS 7 pipy8?(5-4') = A/N. 


JAS 
Second, 
cov{¥S(j i), Yrs} — Pj var{ YS(7) (j)} + S © pycov{ VP (j i), Yor 2) 
VAI 


= (1 ~p,)$2() - an PSU (1) + 92") — 20-7}. 
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We further define }7 52; pS? (9-7) = Because 


S > py S79) = (1— py) 879), S58?) = 8? — 0879), 
VAI VAI 


the covariance between Y°S(j) and YS reduces to 
cov{¥(7), YO} = ae * {2(1 — pj)S?(9) — (1 = pj) S79) — 8? + pjS7(5) + AG} 
Ny" {8?G) — 8° + AG}. 
Third, Y.°S(7) — i has mean Y.(j) — yo p;Y.(j) and variance 
var {Ys (7) =< yore — var{ Ys (j)} at: var(Ys) — Qeov{ VS (j i), Saas 
1 


Le Di edi: 27. 2 } 
= — {TP gt 4 A 92(7)4. 52a, }. 
{este (i) 


Finally, the expectation of the treatment sum of squares is 
E(SSTre) apy N; “i {V2PS(j) iy obs 


J 
=O )— driv Ap + doy { Bsa +a- $(j) +S? - as} 
7=1 J: 


j=1 


which follows from the mean and variance formulas of Y.°>S(j) — Y.°>S. Some algebra gives 


E(SSTre) = SN 


2 


+ S71 = p;)S?(7) +A - 8? +8? — 2A 
j=l 


J 
Ds 
j=l 
J 5 J 
— So p¥.9) > +> 50 —p,)S?(9) - A. 


j=l j=l 
Under Ho(Neyman), ie., Y.(1) =--- = Y.(J), or, equivalently, Y.(j) — Se piY.(j) =0 
for all 7, the expectation of the treatment sum of squares further reduces to 
J 
B(SSTre) = S “(1 — p,;)$?(7) — A. 
j=l 


Because {Y,°°S : W;(j) = 1} is a simple random sample from {Yi (es 
sample variance is unbiased for the population variance, i.e., E{s?,.(j)} = S? 
the mean of the residual sum of squares is 


Deana thé 
(7). “Therefore, 


E(SSRes) = E {(Nj — 1)s255(3)} = 5_ (Nj — 1)5?(J). 


This completes the proof. 


Proof of Corollary 1. Additivity implies S$? = $?(j) for all 7 and A = 0, and the conclusions 
follow. 
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Proof of Corollary 2. For balanced designs, p; = 1/J,Nj; = N/J and S$? = a S7(9)/J, 
and therefore Theorem 2 implies 


J 
E(SSRes) = “—* S- 97(7) = (N - J)S?, 


J 
E(SSTre) = ~ SMG) - FOP +(7- ps? -A. 
j=l 


Moreover, under Hj(Neyman), E(SSRes) is unchanged, and E(SSTre) = (J —1)$?—A. 
Therefore, the expectation of the mean treatment squares is no larger than the expectation of 
the mean residual squares, because E(MSRes) — E(MSTre) = A/(J — 1) > 0. 


Proof of Corollary 3. Under Ho(Neyman), 


J 
E(MSRes) — E(MSTre) = ES 5) s+ 5 
j=l 
N-1)J 2, A 


To prove Theorem 3, we need the following two lemmas: the first is about the quadratic form 
of the multivariate normal distribution, and the second, due to Schur (1911), provides an upper 
bound for the largest eigenvalue of the element-wise product of two matrices. The proof of the 
first follows from straightforward linear algebra, and the proof of the second can be found in 
Styan (1973, Corollary 3). Below we use A * B to denote the element-wise product of A and B, 
i.e, the (2, j)-th element of A * B is the product of the (7, j)-th elements of A and B, Aj; Bj;. 


LEMMA 83. If X ~ N7(0, A), then XTBX ~ = r;&;, where the ;’s are iid x7, and the 
Aj’s are eigenvalues of BA. 


LEMMA S4. If A is positive semidefinite and B is a correlation matrix, then the maximum 
eigenvalue of A x B does not exceed the maximum eigenvalue of A. 


Proof of Theorem 3. We first prove the result under Ho(Neyman), and then view the result 
under Ho(Fisher) as a special case. 

Let Q;= N;/S?(j) for j=1,...,J, and Q= ee Q; be their sum. Define q/, = 
( ee oe Qi?) /QU?, and Py = Ij — qwq}, is a projection matrix of rank J — 1. Let Ys = 
Qui oy Qs j) be a weighted average of the means of the observed outcomes. Accord- 
ing to Li & Ding (2017, Proposition 3), s2,.(7) — S?(j) — 0 in probability (j = 1,..., J). By 
Slutsky’s Theorem, X? has the same asymptotic distribution as 


J 
XB = S705 {FG — Pap} 
j=l 


Define p;;, as the finite population correlation coefficient of potential outcomes {Y;(j)}*®_, and 
{Yi(k)};2,, and R as the corresponding correlation matrix with (j, k)-th element p;;,. The finite 


18 PENG DING AND TIRTHANKAR DASGUPTA 
population central limit theorem (Li & Ding, 2017, Theorem 5) implies 


Qy/?¢7s(1) — ¥.()} 
Qs! {¥s(2) — ¥.(2)} 


Y= 
1/2 5-3 = 
Oy {v(J) = X.(J)} 
1/2 1/2 1/2 1/2 
ae i —p;! py! Pi2°** —p;! py Pls 
1/2. 1/2 1/2. 1/2 
aN, 10 =P Py p21 1— pg 1h py Py P2s _P«R 
1/2.1/2 1/2 1/2 
| _p pi! ps1 _p;/ Dal piers Leapy | 


where P = I; — qq' is the projection matrix defined in the proof of Theorem 1. In the above, 
the mean and covariance matrix of the random vector Vo follow directly from Lemmas S1 and 
82. 


Under Ho(Neyman) with Y.(1) = --. = Y.(J), we can verify that 
2 
J _ 7 1 J 7 7 
XG = OfK™() - KAP - 5 OAL O=V 4 
j=l 71 


which can be further rewritten as a quadratic form (cf. Chung & Romano, 2013) 
XG = Vo" (La — quay) Vo = Vo' Pw Vo. 


According to Lemma S3, x has asymptotic distribution ee Aj&j, where the \j;’s are the J — 
1 nonzero eigenvalues of P,,(P *« R). The summation is from 7 = 1 to J — 1 because P,,(P * R) 
has rank at most J — 1. The eigenvalues (A1,..., \.y—1) are all smaller than or equal to the largest 
eigenvalue of P * R, because Py is a projection matrix. According to Lemma S4, the maximum 
eigenvalue of the element-wise product P « R is no larger than the maximum eigenvalue of 
P, which is 1. Therefore, XG ~ ey Aj&j, where A; < 1 for all 7. Because the a ae can be 
represented as €; + --- + &j_1, it is clear that the asymptotic distribution of Xx? is stochastically 
dominated by oe 

When performing the Fisher randomization test, we treat all observed outcomes as fixed, and 
consequently, the randomization distribution is essentially the repeated sampling distribution of 
X? under Y;(1) = --- = Yi(J) = Y,°°S. This restricts $?(j) to be constant, and the correlation 
coefficients between potential outcomes to be 1. Correspondingly, P,, = P, R = 11%, and the 
asymptotic covariance matrix of Vo is P. Applying Lemma S3 again, we know that the asymp- 
totic randomization distribution of X? is Noes because PP = P has J — 1 nonzero eigenvalues 
and all of them are 1. 

Mathematically, the randomization distribution under Ho(Fisher) is the same as the permu- 
tation distribution. Therefore, applying Chung & Romano (2013) yields the same result for X? 
under Ho(Fisher). 


Proof of Corollary 4. As shown in the proof of Theorem 3, X? is asymptotically equivalent to 
X{%, and therefore we need only to show the equivalence between (J — 1)F and X@. If S?(1) = 
ves SJ) = 6, then Yor = 7 Pt and 


yoy aaa) See _ SSTre 
S? Saag 
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Because MSRes = yeas — 1)s2,,(7)/(N — J) converges to S? in probability (Li & Ding, 
2017, Proposition 3), Slutsky’s Theorem implies 


SSTre . SSTre 
MSRes S2 


(J-1)F = 


Therefore, (J —1)F ~ X@ < X?. 
Proof of Corollary 5. First, we discuss F. Because Y.°°S = p; ¥°S(1) + p2¥°S(2), we have 
yen) = yos = pot (1, ayy yor) = yous = —pif (1, 2). 


The treatment sum of squares reduces to 
SSTre = N, {¥°(1) — VO}? + Ny {¥°(2) — ¥%}? = Npipo??(1, 2), 


and the residual sum of squares reduces to SSRes = (Ni — 1)s2,.(1) + (No — 1)s2,,(2). There- 
fore, the F' statistic reduces to 


es SSTre _ a7 (122) 7 G12) 


SSRes/(N — 2) gH Ths Sns(1) + CDN NG Sebw(2) obs 1)/No + $45(2)/Na 


where the approximation follows from ignoring the difference between N and N — 2 and the dif- 
ference between N; and N; — 1 (j = 1,2). Following from Theorem | or proving it directly, we 
know that F ~ F\ y—2 ~ x7 under Ho (Fisher). However, under Ho (Neyman), Neyman (1923), 
coupled with the finite population central limit theorem (Li & Ding, 2017, Theorem 5), imply 


a1; 2) 


~ N(0,1), 
{sa $2(2)  §2(1-2) ae (0,1) 
Nir aN N 
and s?,.(j) — S?(j) in probability (j = 1,2). Therefore, the asymptotic distribution of F under 


Ho(Neyman) is F ~ C,x?, where 


__$2(1)/Ny + S2(2)/N2 — $?(1-2)/N 
N-+-400 S2(1)/No + S2(2)/Ny 


Second, we discuss X?. Because 


7 Ny = No = N. N: 
| 5 1 VON 5 2 rma / 4 1 2 \, 


Sops (1) Sops (2) S() 82,,(2) 
we have 
= = N: N: N: 
P() - 7 = ery / {otal 
Sops (2) Sips (1) Sips (2) 
~ = MN. Ni No 
pas a ane ae =— 70,2)/ 4 + \. 
= 834¢(1) a (D) 824,(2) 
Therefore, the X? statistic reduces to 
Ni Ne No N?2 N. No )? 
2 1 2-2 2 142 at 2 
el Os ory Oy lea eo) 
sobs sobs sobs sobs sobs sobs 


_ 77 (1,2) 
a 825(1)/N1 oF 82,.(2)/No : 
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Following from Theorem 3 or proving it directly, we know that X? ~ y? under Ho(Fisher). 
However, under Ho(Neyman), we can use an argument similar to that for F and obtain X? ~ 
C2x?, where 


C= |i S?(1)/N, + S?(2)/No — $?(1-2)/N 4 
emma S2(1)/N, + $2(2)/No an 


The constant C2 is smaller than or equal to 1 with equality holding if the limit of $?(1-2) is zero, 
Le., the unit-level treatment effects are constant asymptotically. 


Proof of Corollary 6. In the Fisher randomization test, ce is fixed, and therefore using 7 (1, 2) 


is equivalent to using T?. Using simple algebra similar to Ding (2017), we have the following 
decomposition 
(N — 1)8oo5 = (Ni — 1)8eo5(1) + (No — 1)Sops(2) + Ni No? (1, 2)/N, 


obs 


which implies the equivalent formula of T? in Corollary 6. Under Ho(Fisher) or Ho(Neyman), 
7(1, 2) — Oin probability, which coupled with Slutsky’s Theorem, implies the asymptotic equiv- 
alence T? ~ F. 


S9. NUMERICAL EXAMPLES 


Example S1. We consider J = 3, sample sizes Ny = 120, No = 80 and N3 = 40. We gener- 
ate the first set of potential outcomes from 


Yi(1) ~ N(0, 1), ¥i(2) = 38¥i(1), Yi(3) = 5Yi(1), (S6) 
and the second set of potential outcomes from 
¥i(1) ~ N(0; 1), ¥i(2) ~ NO, 3°), Vi(3) ~ N(0, 5"). (S7) 


After generating the potential outcomes, we center the Y;(j)’s by subtracting the mean to make 
Y.(j) = 0 for all 7 so that Hy(Neyman) holds. Figure $4 shows the distributions of X? over 
repeated sampling of the treatment assignment vector (W1,..., Wy) for potential outcomes 
generated from (S6) and (S7). The true sampling distributions under both cases are stochastically 
dominated by \3. Under (S6), the correlation coefficients between the potential outcomes are 1; 
whereas under (S7), the correlation coefficients are 0. With less correlated potential outcomes, 
the gap between the true distribution and 3 becomes larger. 


Example S2. We use an example from Montgomery (2000, Exercise 3.15) with 4 treatment 
levels. The sample variances and the sample sizes differ for the four treatment levels, as shown 
in Table S1. The p-values of the Fisher randomization test using F' and X? are 0.003 and 0.010, 
respectively. If we choose a stringent size, say a = 0.01, then the evidence against the null 
is strong from the first test, but the evidence is weak from the second test. If our interest is 
Ho(Neyman), then the different strength of evidence may be due to the different variances and 
sample sizes of the treatment groups. Because of this, we recommend making decision based on 
the Fisher randomization test using X?. 


Example S3. We reanalyze the data from Angrist et al. (2009), which contain a control group 
and 3 treatment groups designed to improve academic performance among college freshmen. 
Table S2 summaries the sample sizes, means and variances of the final grades under 4 treat- 
ment groups. The p-values of the Fisher randomization test using F and X? are 0.058 and 
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Fig. S4: Distributions of X?. The histograms are the sampling distributions, the dotted lines are 
the asymptotic distributions, and the solid lines are the 3 distribution. 


Table S1: A randomized experiment with J = 4 


1 2 3 4 
observed outcome 58.2 56.3 50.1 52.9 
57.2 54.5 54.2 49.9 
58.4 57.0 55.4 50.0 


55.8 55.3 51.7 
54.9 
sample size 2) 4 3 4 
mean 56.9 55.8 53.2 51.1 
variance 2.3 1.2 7.7 2.1 


Table $2: A randomized experiment with J = 4, where control, sfp, ssp and sfsp denote the four 
treatment groups. 


control sfp ssp sfsp 

sample size 854 219 212 119 
mean 63.86 65.83 64.13 66.10 
variance 144.97 124.45 159.76 114.33 


0.045, respectively. The Fisher randomization tests using F and X°? give different conclu- 
sions at the commonly used significance level of 0.05. In this unbalanced experiment, the 
Fisher randomization test using F' is less powerful. 
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S10. MORE SIMULATION WITH NONNORMAL OUTCOMES 
$10-1. Type I error of the Fisher randomization test using F 

In this subsection, we use simulation to evaluate the finite sample performance of the 
Fisher randomization test using F’ under Ho(Neyman). We consider the following three cases, 
where € denotes an exponential distribution with mean 1. 

Case S1. For balanced experiments with sample sizes N = 45 and N = 120, we generate po- 
tential outcomes under two cases: (S1.1) Y;(1) ~ €, Y;(2) ~ €/0.7, Y;(3) ~ €/0.5; and (S1.2) 
Y;(1) ~ €, Y;(2) ~ €/0.5, Yi(3) ~ €/0.3. These potential outcomes are independently gener- 
ated, and standardized to have zero means. 

Case S2. For unbalanced experiments with sample sizes (Ni, No, N3) = (10, 20,30) and 
(Ni, No, N3) = (20, 30, 50), we generate potential outcomes under two cases: (S2.1) Y;(1) ~ €, 
¥;(2) = 2Y;(1), ¥;(3) = 3Y;(1); and ($2.2) ¥i(1) ~ €, ¥i(2) = 3Y;(1), ¥i(3) = 5Y;(1). These 
potential outcomes are standardized to have zero means. In this case, pj < p2 < p3 and $?(1) < 
§2(2) <S7(3), 

Case S3. For unbalanced experiments with sample sizes (Ni, No, N3) = (30, 20,10) and 
(N,, No, N3) = (50, 30, 20), we generate potential outcomes under two cases: (S3.1) Y;(1) ~ €, 
Y;(2) = 1.2¥;(1), ¥;(3) = 1.5Y;(1); and (S3.2) ¥;(1) ~ €, ¥i(2) = 1.5Y;(1), ¥i(3) = 2¥;(1). 
These potential outcomes are standardized to have zero means. In this case, py > pa > p3 and 
87 (1) <57(2) <8? 3): 

We follow 86-1 and obtain the same conclusions about the Fisher randomization test using F’, 
because Figures 1 and S5 exhibit the same pattern. 

In Figure 5(a), for case (S1.1), the rejection rates are 0.022 and 0.014, and for case (S1.2), 
the rejection rates are 0.030 and 0.030, for sample sizes N = 45 and N = 120 respectively. 
In Figure 5(b), for case (S2.1), the rejection rates are 0.018 and 0.024, and for case (2.2), the 
rejection rates are 0.026 and 0.018, for sample sizes N = 45 and N = 120 respectively. The 
Monte Carlo standard errors are all close to but no larger than 0.003. 

In Figure 5(c), for case (S3.1), the rejection rates are 0.076 and 0.086, and for case 
(S3.2), the rejection rates are 0.108 and 0.109, for sample sizes N = 45 and N = 120 re- 
spectively, with all Monte Carlo standard errors no larger than 0.008. In these two cases, the 
Fisher randomization test using F’ does not preserve correct type I error. 


$10-2. Type I error of the Fisher randomization test using X? 

We follow 86-2, generate the same data as §S10-1, and obtain the same conclusions about the 
Fisher randomization test using X”, because Figures 2 and S6 exhibit the same pattern. All the 
Monte Carlo standard errors of the rejection rates below are close but no larger than 0.005. 

In Figure 6(a), for case (S1.1), the rejection rates are 0.034 and 0.018, and for case (S1.2), the 
rejection rates are 0.048 and 0.029, for sample sizes N = 45 and N = 120 respectively. In Figure 
6(b), for case (S2.1), the rejection rates are 0.032 and 0.035, and for case (S2.2), the rejection 
rates are 0.025 and 0.036, for sample sizes N = 45 and N = 120 respectively. In Figure 6(c), 
for case (S3.1), the rejection rates are 0.060 and 0.062, and for case (S3.2), the rejection rates are 
0.054 and 0.044, for sample sizes N = 45 and N = 120 respectively. This, coupled with Figure 
S5, agrees with our theory that the Fisher randomization test using X7 can control type I error 
under Hg(Neyman) better than using F’. 


§10-3. Power comparison of the Fisher randomization tests using F and X? 


We follow §6-3 to compare the powers of the Fisher randomization tests using F and X?. We 
consider the following cases and summarize the results in Figure S7. 
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Fig. S5: Histograms of the p-values under Hp(Neyman) based on the Fisher randomization tests 
using X?, with grey histogram and white histograms for the first and second sub-cases. 
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Fig. S6: Histograms of the p-values under Hp(Neyman) based on the Fisher randomization tests 
using X?, with grey histogram and white histograms for the first and second sub-cases. 


Case S4. For balanced experiments with sample sizes N = 30 and N = 45, we generate po- 
tential outcomes from Y;(1) ~ €, Y;(2) ~ €/0.7, Y;(3) ~ €/0.5. These potential outcomes are 
independently generated, and shifted to have means (0, 0.5, 1). 

Case S5. For unbalanced experiments with sample sizes (Ni, No, N3) = (10, 20,30) and 
(Ni, No, N3) = (20, 30,50), we first generate Y;(1) ~ € and standardize them to have mean 
zero, and we then generate Y;(2) = 3Y;(1) + 1 and Y;(3) = 5Y;(1) + 2. In this case, pi < po < 
p3 and $?(1) < $2(2) < $?(3). 
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Fig. S7: Histograms of the p-values under alternative hypotheses based on the Fisher randomiza- 
tion tests using F and X°, with grey histograms for X? and white histograms for F. 


Case S6. For unbalanced experiments with sample sizes (Ni, No, N3) = (30, 20,10) and 
(N,, No, N3) = (50, 30, 20), we generate potential outcomes the same as the above case S5. 
In this case, p1 > po > p3 and $?(1) < $?(2) < $?(3). 

When the sample sizes are positively associated with the variances of the potential outcomes, 
the Fisher randomization test using F' has larger power than that using X?. However, when the 
treatment groups are balanced or when the sample sizes are negatively associated with the vari- 
ances of the potential outcomes, the Fisher randomization test using F’ has smaller power than 
that using X?. We report the rejection rates below with all the Monte Carlo standard errors no 
larger than 0.01. 

For case S4, the rejection rates using X? and F are 0.087 and 0.066 with sample size N = 30, 
and 0.207 and 0.198 with sample size N = 45. For case S5, the powers using X? and F are 0.044 
and 0.106 with sample size N = 60, and 0.293 and 0.729 with sample size N = 100. For case 
S6, the rejection rates using X? and F are 0.211 and 0.037 with sample size N = 60, and 0.578 
and 0.274 with sample size N = 100. 


$10-4. Finite sample evaluation of Corollary 4 with skewed outcomes 


We first generate log-normal potential outcomes Y;(1) ~ exp{N(0,1)}, Yi(2) ~ 
exp{N(1,1)}, and Y;(3) ~ exp{A/(2,1)}, and then standard them to have equal finite 
population means 0 and variances 1. 

Under Ho(Neyman), the p-values of the Fisher randomization test using F and X? are shown 
in Figure S8(a). With sample size (Ny, N2, N3) = (10, 10, 10), the rejection rates using X? and 
F are 0.012 and 0.016; with sample size (10, 15,20), the rejection rates are 0.016 and 0.028; 
with sample size (20, 15, 10), the rejection rates are 0.006 and 0.015. The Monte Carlo standard 
errors are all close to but no larger than 0.004. 

Under alternative hypotheses, the p-values of the Fisher randomization test using F' and X? 
are shown in Figure $8(b). With sample size (Nj, N2, N3) = (10, 10, 10), we shift the potential 
outcomes by constants (0, 0.5, 1), and the rejection rates using X? and F are 0.514 and 0.512; 


A randomization-based perspective of analysis of variance 25 


ire) roy ro} 


= boca EP ait Jeet ae S4t-------g----- asa |— me [et JE) | Qsctceceeccoc sok ae wee PE he |) 


density 
density 
density 


05 
] 
05 
05 


0.0 
0.0 
0.0 


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 


N=30 Balanced N=45, (N1,N2,N3)=(10,15,20) N=45, (N1,N2,N3)=(20, 15,10) 


(a) Ho(Neyman) holds 


density 
density 
density 


| Aliana ~ LE TT oate ~ EL pete oor i 


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 


N=30 Balanced N=45, (N1,N2,N3)=(10,15,20) N=45, (N1,N2,N3)=(20,15,10) 


(b) Ho(Neyman) does not hold 


Fig. S8: Histograms of the p-values under equal finite population variances based on the Fisher 
randomization tests using F and X?, with grey histograms for X? and white histograms for F’. 


with sample size (10, 15, 20), we shift the potential outcomes by constants (0, 0.2, 0.5), and the 
rejection rates are 0.164 and 0.215; with sample size (20, 15, 10), we shift the potential outcomes 
by constants (0, 0.2, 0.5), and the rejection rates are 0.256 and 0.179. The Monte Carlo standard 
errors are all close but no larger than 0.011. 

In finite samples, we observe moderate difference between the Fisher randomization tests 
using X? and F even with homoskedastic potential outcomes, although Corollary 4 ensures 
their asymptotic equivalence. 


